cooking action
Deep Learning-Driven Multimodal Detection and Movement Analysis of Objects in Culinary
Ishat, Tahoshin Alam, Qayum, Mohammad Abdul
Abstract--This research investigates the opportunity of an intelligent, multi-modal AI system interpreting visual,audio and motion based data to analyse and comprehend cooking recipes. The system is integrated with object segmentation, hand motion classification and auido to text convertion with help of natural language processing to create a comprehensive pipeline that imitates human level understanding of kitchen tasks and recipies. The early stages of the project involved experimenting with Pre-made dataset, specially COCO dataset for object segmentation, which yielded suboptimal for use case of the project. T o overcome this, a domain-specific dataset was curated by collecting and annotating over 7,000 kitchen-related images, later augmented to 17,000 images. Several YOLOv8 segmentation models were trained on this dataset to detect 16 essential kitchen objects. Additionally, short-duration videos capturing cooking actions were collected and processed using MediaPipe to extract hand, elbow, and shoulder keypoints. These were used to train an LSTM-based model for hand action classification and incorporated Whisper, a audio-to-text transcription model and leverage a large language model such as TinyLlama to generate structured cooking recipes from the multi-modal inputs. A. Background and motivation In the era of computer vision and automation of every crucial task in our day to day life is also being infiltrated by artificial intelligence and machines.
- Research Report (0.64)
- Workflow (0.46)
Real-World Cooking Robot System from Recipes Based on Food State Recognition Using Foundation Models and PDDL
Kanazawa, Naoaki, Kawaharazuka, Kento, Obinata, Yoshiki, Okada, Kei, Inaba, Masayuki
Although there is a growing demand for cooking behaviours as one of the expected tasks for robots, a series of cooking behaviours based on new recipe descriptions by robots in the real world has not yet been realised. In this study, we propose a robot system that integrates real-world executable robot cooking behaviour planning using the Large Language Model (LLM) and classical planning of PDDL descriptions, and food ingredient state recognition learning from a small number of data using the Vision-Language model (VLM). We succeeded in experiments in which PR2, a dual-armed wheeled robot, performed cooking from arranged new recipes in a real-world environment, and confirmed the effectiveness of the proposed system.
PizzaCommonSense: Learning to Model Commonsense Reasoning about Intermediate Steps in Cooking Recipes
Diallo, Aissatou, Bikakis, Antonis, Dickens, Luke, Hunter, Anthony, Miller, Rob
Decoding the core of procedural texts, exemplified by cooking recipes, is crucial for intelligent reasoning and instruction automation. Procedural texts can be comprehensively defined as a sequential chain of steps to accomplish a task employing resources. From a cooking perspective, these instructions can be interpreted as a series of modifications to a food preparation, which initially comprises a set of ingredients. These changes involve transformations of comestible resources. For a model to effectively reason about cooking recipes, it must accurately discern and understand the inputs Figure 1: A graphical depiction of the PizzaCommonsense and outputs of intermediate steps within the underlying motivation. Models are required to recipe. Aiming to address this, we present a learn knowledge about the input and output of each intermediate new corpus of cooking recipes enriched with step and predict the correct sequencing of descriptions of intermediate steps of the recipes these comestibles given the corresponding instructions that explicate the input and output for each step.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
- Workflow (0.66)
- Research Report (0.64)
Cook-Gen: Robust Generative Modeling of Cooking Actions from Recipes
Venkataramanan, Revathy, Roy, Kaushik, Raj, Kanak, Prasad, Renjith, Zi, Yuxin, Narayanan, Vignesh, Sheth, Amit
As people become more aware of their food choices, food computation models have become increasingly popular in assisting people in maintaining healthy eating habits. For example, food recommendation systems analyze recipe instructions to assess nutritional contents and provide recipe recommendations. The recent and remarkable successes of generative AI methods, such as auto-regressive large language models, can lead to robust methods for a more comprehensive understanding of recipes for healthy food recommendations beyond surface-level nutrition content assessments. In this study, we explore the use of generative AI methods to extend current food computation models, primarily involving the analysis of nutrition and ingredients, to also incorporate cooking actions (e.g., add salt, fry the meat, boil the vegetables, etc.). Cooking actions are notoriously hard to model using statistical learning methods due to irregular data patterns - significantly varying natural language descriptions for the same action (e.g., marinate the meat vs. marinate the meat and leave overnight) and infrequently occurring patterns (e.g., add salt occurs far more frequently than marinating the meat). The prototypical approach to handling irregular data patterns is to increase the volume of data that the model ingests by orders of magnitude. Unfortunately, in the cooking domain, these problems are further compounded with larger data volumes presenting a unique challenge that is not easily handled by simply scaling up. In this work, we propose novel aggregation-based generative AI methods, Cook-Gen, that reliably generate cooking actions from recipes, despite difficulties with irregular data patterns, while also outperforming Large Language Models and other strong baselines.
- North America > United States > South Carolina > Richland County > Columbia (0.15)
- North America > United States > Colorado > Boulder County > Boulder (0.04)
- Health & Medicine > Consumer Health (1.00)
- Education > Health & Safety > School Nutrition (0.48)
- Government > Regional Government > North America Government > United States Government (0.46)
Visual Recipe Flow: A Dataset for Learning Visual State Changes of Objects with Recipe Flows
Shirai, Keisuke, Hashimoto, Atsushi, Nishimura, Taichi, Kameko, Hirotaka, Kurita, Shuhei, Ushiku, Yoshitaka, Mori, Shinsuke
We present a new multimodal dataset called Visual Recipe Flow, which enables us to learn each cooking action result in a recipe text. The dataset consists of object state changes and the workflow of the recipe text. The state change is represented as an image pair, while the workflow is represented as a recipe flow graph (r-FG). The image pairs are grounded in the r-FG, which provides the cross-modal relation. With our dataset, one can try a range of applications, from multimodal commonsense reasoning and procedural text generation.
- Asia > Japan > Honshū > Kansai > Kyoto Prefecture > Kyoto (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- Workflow (0.88)
- Research Report (0.64)
Affect Sensing in Metaphorical Phenomena and Dramatic Interaction Context
Zhang, Li (Teesside University)
Metaphorical interpretation and affect detection using context profiles from open-ended text input are challenging in affective language processing field. In this paper, we explore recognition of a few typical affective metaphorical phenomena and context-based affect sensing using the modeling of speakers’ improvisational mood and other participants’ emotional influence to the speaking character under the improvisation of loose scenarios. The overall updated affect detection module is embedded in an AI agent. The new developments have enabled the AI agent to perform generally better in affect sensing tasks. The work emphasizes the conference themes on affective dialogue processing, human-agent interaction and intelligent user interfaces.
- North America > United States > New York (0.04)
- Europe > United Kingdom > England > Staffordshire (0.04)
- Europe > United Kingdom > England > North Yorkshire > Middlesbrough (0.04)
- Europe > Spain > Canary Islands > Gran Canaria (0.04)